I am feeling excited. I expect to learn basically R in this course. I received an email about this course from university of Helsinki. Here is my forked repository:
GitHub repository : https://github.com/chamalee/IODS-project
This week I have covered data wrangling and linear regression analysis.
date()
## [1] "Sat Nov 27 18:32:46 2021"
learning2014 <- read.csv(file = "data/learning2014.csv", header = TRUE)
str(learning2014)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166 8
This dataset contains data of 183 participants from a course which was collected in 2014. This dataset has 7 columns as follows.
Background of measures:
Measures A and C are based on parts A and C in ASSIST (Approaches and Study Skills Inventory for Students) http://www.etl.tla.ed.ac.uk/publications.html#measurement and D is based on SATS (Survey of Attitudes Toward Statistics) http://www.evaluationandstatistics.com/
The items of the central measure in this study (ASSIST B) are named so that the connections to the corresponding dimensions (Deep/SUrface/STrategic) can be easily seen.
# access the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# create a more advanced plot matrix with ggpairs() : draws all possible scatter plots from the columns of a data frame, resulting in a scatter plot matrix.
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
From above graphical overview, we can see the relationships between the variables. It shows the correlations between different variables and distributions of each variables with respect to ‘gender’ variable.
The colour of plots is defined with the ‘gender’ variable. Pink color plots correspond to Female and green color for Male.
In this graphical overview, we can see the correlations between all variables. We have a negative correlation between points and age. The trend is that when age increases points decreases. And there is a positive correlation between points are attitudes. The trend is that when attitude increases point increases. If we consider correlation between surf variable and deep variable, there is a strong correlation in males where as a weak correlation in females. However, we can also observe that there are approximately twice number of more female than male.
# a scatter plot of points versus attitude
library(ggplot2)
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
# fit a linear model # create a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + deep, data = learning2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + deep, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3915 3.4077 3.343 0.00103 **
## attitude 3.5254 0.5683 6.203 4.44e-09 ***
## stra 0.9621 0.5367 1.793 0.07489 .
## deep -0.7492 0.7507 -0.998 0.31974
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
Linear model: formula = points ~ attitude + stra + deep
The best model is found by minimizing the prediction errors (residuals) that model would make. Goal is to minimize the sum of squared residuals. The parameters of residuals are as follows.
Parameters of Residuals:
Min -17.5239, 1Q -3.4276, Median 0.5474, 3Q 3.8220, Max 11.5112.
Coefficients give the estimates of the model parameters corresponding to each variable along with the intercept. Intercept corresponds to the value in y-axis where linear model intersects y-axis.
Coefficients:
(Intercept) 11.3915
attitude 3.5254
stra 0.9621
deep -0.7492
P-values:
The P-value which corresponds to attitude is very small. Therefore, we can conclude there is a strong correlation between the points variable and attitude variable. Due to high p-value of deep variable, there is the least correlation between points variable and deep variable. P-values are as follows.
attitude 4.44e-09 *** stra 0.07489 .
deep 0.31974
We can even observe standard error values of parameter estimates from Std. Error column in the summary - table of coefficients.
Here, it was considered, points as the target variable and attitude, stra and deep as the explanatory variables. It implies we are trying to estimate points variable as a multiple linear model of attitude, stra and deep variables. We can see that there is a statistically significant relationship between exam points and attitude, but not with deep. This means that deep does not interpret anything about exam points, but the attitude do interpret. Therefore, deep variable was removed from the consideration and remodelled only with attitude and stra variables.
With attitude + stra + deep variables -> Multiple R-squared: 0.2097 -> approximately 20% With attitude + stra variables -> Multiple R-squared: 0.2048 -> approximately 20%
Multiple R-squared implies the multiple correlation coefficient between three or more variables. It implies how strong the linear relationship is. In other words, how well the regression model fits the observed data. This value ranges from 0 to 1. For example, an R-squared of 20% reveals that 20% of the data fit the regression model. Generally, a higher R-squared indicates a better fit for the model. In this case, we can see after we remove deep variable, Multiple R-squared approximately remains unchanged. That is because deep variable does not have a high impact towards point variable.
# a scatter plot of points versus attitude
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra, data = learning2014)
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model2, which = c(1,2,5))
Model assumptions in linear regression are as follows.
Analyzing the residuals of the model provides a method to explore the validity of model assumptions.
Assumptions related to errors are as follows.
Residuals vs Fitted-plot provides a method to explore the assumption that the size of a given error does not depend on the explanatory variables. Any pattern will imply a problem in the assumption. As we can see in above Residuals vs Fitted-plot, there is a reasonable random spread of points. We can not observe a pattern; but a constant variance. It confirms the assumption that the size of the errors does not depend on the explanatory variables.
Normal QQ-plot provides a method to explore the assumption that errors of the model are normally distributed. As we can see in above QQ-plot, there is a reasonable fit in the middle. But in the corners, there is a clear deviation which makes the normality assumption being questionable.
The Residuals vs leverage shows how much impact single point has on the model. It helps to identify which points have an unusually high impact. As we can see in above Residuals vs leverage plot, there are no highly unusual outliers.
This week I have covered data wrangling and logistic regression analysis.
date()
## [1] "Sat Nov 27 18:32:55 2021"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# read the data into memory
alc <- read.table("https://github.com/rsund/IODS-project/raw/master/data/alc.csv", sep=",", header=TRUE)
dim(alc)
## [1] 370 51
glimpse(alc)
## Rows: 370
## Columns: 51
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
## $ age <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,…
## $ address <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R", "U", "U", "U",…
## $ famsize <chr> "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "LE3", "LE…
## $ Pstatus <chr> "T", "T", "T", "T", "T", "T", "T", "T", "T", "A", "A", "T",…
## $ Medu <int> 1, 1, 2, 2, 3, 3, 3, 2, 3, 3, 4, 1, 1, 1, 1, 1, 2, 2, 2, 3,…
## $ Fedu <int> 1, 1, 2, 4, 3, 4, 4, 2, 1, 3, 3, 1, 1, 1, 2, 2, 1, 2, 3, 2,…
## $ Mjob <chr> "at_home", "other", "at_home", "services", "services", "ser…
## $ Fjob <chr> "other", "other", "other", "health", "services", "health", …
## $ reason <chr> "home", "reputation", "reputation", "course", "reputation",…
## $ guardian <chr> "mother", "mother", "mother", "mother", "other", "mother", …
## $ traveltime <int> 2, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 3, 1, 2, 1,…
## $ studytime <int> 4, 2, 1, 3, 3, 3, 3, 2, 4, 4, 2, 1, 2, 2, 2, 2, 3, 4, 1, 2,…
## $ schoolsup <chr> "yes", "yes", "yes", "yes", "no", "yes", "no", "yes", "no",…
## $ famsup <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ activities <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "no", "no", …
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "no"…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet <chr> "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes…
## $ romantic <chr> "no", "yes", "no", "no", "yes", "no", "yes", "no", "no", "n…
## $ famrel <int> 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 5, 5, 3, 3,…
## $ freetime <int> 1, 3, 3, 3, 2, 3, 2, 1, 4, 3, 3, 3, 3, 4, 3, 2, 2, 1, 5, 3,…
## $ goout <int> 2, 4, 1, 2, 1, 2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 1, 2,…
## $ Dalc <int> 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1,…
## $ Walc <int> 1, 4, 1, 1, 3, 1, 2, 3, 3, 1, 1, 2, 3, 2, 1, 2, 1, 1, 1, 1,…
## $ health <int> 1, 5, 2, 5, 3, 5, 5, 4, 3, 4, 1, 4, 4, 5, 5, 1, 4, 3, 5, 3,…
## $ n <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ id.p <int> 1096, 1073, 1040, 1025, 1166, 1039, 1131, 1069, 1070, 1106,…
## $ id.m <int> 2096, 2073, 2040, 2025, 2153, 2039, 2131, 2069, 2070, 2106,…
## $ failures <int> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,…
## $ paid <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no…
## $ absences <int> 3, 2, 8, 2, 5, 2, 0, 1, 9, 10, 0, 3, 2, 0, 4, 1, 2, 6, 2, 1…
## $ G1 <int> 10, 10, 14, 10, 12, 12, 11, 10, 16, 10, 14, 10, 11, 10, 12,…
## $ G2 <int> 12, 8, 13, 10, 12, 12, 6, 10, 16, 10, 14, 6, 11, 12, 12, 14…
## $ G3 <int> 12, 8, 12, 9, 12, 12, 6, 10, 16, 10, 15, 6, 11, 12, 12, 14,…
## $ failures.p <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ paid.p <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no…
## $ absences.p <int> 4, 2, 8, 2, 2, 2, 0, 0, 6, 10, 0, 6, 2, 0, 6, 0, 0, 4, 4, 2…
## $ G1.p <int> 13, 13, 14, 10, 13, 11, 10, 11, 15, 10, 15, 11, 13, 12, 13,…
## $ G2.p <int> 13, 11, 13, 11, 13, 12, 11, 10, 15, 10, 14, 12, 12, 12, 12,…
## $ G3.p <int> 13, 11, 12, 10, 13, 12, 12, 11, 15, 10, 15, 13, 12, 12, 13,…
## $ failures.m <int> 1, 2, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,…
## $ paid.m <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "yes", "no",…
## $ absences.m <int> 2, 2, 8, 2, 8, 2, 0, 2, 12, 10, 0, 0, 2, 0, 2, 2, 4, 8, 0, …
## $ G1.m <int> 7, 8, 14, 10, 10, 12, 12, 8, 16, 10, 14, 8, 9, 8, 10, 16, 1…
## $ G2.m <int> 10, 6, 13, 9, 10, 12, 0, 9, 16, 11, 15, 0, 10, 11, 11, 15, …
## $ G3.m <int> 10, 5, 13, 8, 10, 11, 0, 8, 16, 11, 15, 0, 10, 11, 11, 15, …
## $ alc_use <dbl> 1.0, 3.0, 1.0, 1.0, 2.5, 1.0, 2.0, 2.0, 2.5, 1.0, 1.0, 1.5,…
## $ high_use <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE,…
## $ cid <int> 3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 3009, 3010,…
# summary(alc)
This dataset contains data of 370 participants in secondary education of two Portuguese schools. This dataset has 51 columns. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. This combined datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008].
The purpose of the analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. 4 interesting variables : - sex : I think female student might consume less amount of alcohol with compared to male students. - age : I think higher the age, the highest consumption. - romantic : I expect that having a committed relationship will make the least alcohol consumption. - freetime : I expect that more free time will allow the students for high alcohol consumption. In my point of view, high_use variable will behave approximately similar to alc_use.
# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
library(dplyr)
alc_questions <- c("alc_use", "high_use", "sex", "age", "romantic", "freetime")
# alc_cols %>%
# select(alc,one_of(alc_questions))
alc_cols <- alc[,alc_questions]
# create a more advanced plot matrix with ggpairs() : draws all possible scatter plots from the columns of a data frame, resulting in a scatter plot matrix.
p <- ggpairs(alc_cols, mapping = aes(col = sex, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
p <- ggpairs(alc_cols, mapping = aes(col = high_use, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
We can Immediately notice that,
#
# # access the 'tidyverse' packages dplyr and ggplot2
# library(dplyr); library(ggplot2)
#
# # define a new column alc_use by combining weekday and weekend alcohol use
# alc <- mutate(alc, alc_use = (Dalc + Walc) / 2)
#
# # initialize a plot of alcohol use
# g1 <- ggplot(data = alc, aes(x = alc_use, fill = sex))
#
# # define the plot as a bar plot and draw it
# g1 + geom_bar()
#
# # define a new logical column 'high_use'
# alc <- mutate(alc, high_use = alc_use > 2)
#
# # initialize a plot of 'high_use'
# g2 <- ggplot(alc, aes(high_use))
#
# # draw a bar plot of high_use by sex
# g2 + facet_wrap("sex") + geom_bar()
Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable.
model <- glm(high_use ~ age + freetime,
data = alc,
family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ age + freetime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3711 -0.8959 -0.7071 1.3148 1.9878
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.30605 1.68463 -3.150 0.00163 **
## age 0.19413 0.09738 1.994 0.04620 *
## freetime 0.37360 0.12149 3.075 0.00210 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 438.07 on 367 degrees of freedom
## AIC: 444.07
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(model)
## (Intercept) age freetime
## -5.3060538 0.1941321 0.3735975
What we can conclude from these coefficients is that:
Next, we will add romantic to the model.
model2 <- glm(high_use ~ age + freetime + romantic,
data = alc,
family = "binomial")
summary(model2)
##
## Call:
## glm(formula = high_use ~ age + freetime + romantic, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4318 -0.8620 -0.7207 1.3094 1.9994
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.47316 1.69354 -3.232 0.00123 **
## age 0.20842 0.09841 2.118 0.03418 *
## freetime 0.37709 0.12165 3.100 0.00194 **
## romanticyes -0.26045 0.25381 -1.026 0.30482
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 437.00 on 366 degrees of freedom
## AIC: 445
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(model2)
## (Intercept) age freetime romanticyes
## -5.4731577 0.2084192 0.3770938 -0.2604451
Very high p-value of romantic indicates that it is the least significant to high_use. We can conclude here that my previous hypothesis is not fully accurate since romantic does not show a strong correlation with high_use.
Next, we will add sex to the model. It will create my previously stated hypothesis model.
model3 <- glm(high_use ~ age + freetime + romantic + sex,
data = alc,
family = "binomial")
summary(model3)
##
## Call:
## glm(formula = high_use ~ age + freetime + romantic + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5445 -0.8506 -0.6576 1.2234 2.1190
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.74785 1.72535 -3.331 0.000864 ***
## age 0.21575 0.09994 2.159 0.030870 *
## freetime 0.29151 0.12479 2.336 0.019491 *
## romanticyes -0.20455 0.25873 -0.791 0.429182
## sexM 0.80662 0.24192 3.334 0.000855 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 425.66 on 365 degrees of freedom
## AIC: 435.66
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(model3)
## (Intercept) age freetime romanticyes sexM
## -5.7478538 0.2157461 0.2915068 -0.2045533 0.8066158
Very low p-value of sex indicates that it is the most significant to high_use.
We can finalize the model as follows with the most significant variables.
model4 <- glm(high_use ~ age + freetime + sex,
data = alc,
family = "binomial")
summary(model4)
##
## Call:
## glm(formula = high_use ~ age + freetime + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4981 -0.8659 -0.6393 1.2428 2.0520
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.62126 1.71834 -3.271 0.001070 **
## age 0.20483 0.09903 2.068 0.038605 *
## freetime 0.28665 0.12457 2.301 0.021383 *
## sexM 0.81972 0.24134 3.396 0.000683 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 426.29 on 366 degrees of freedom
## AIC: 434.29
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(model4)
## (Intercept) age freetime sexM
## -5.6212569 0.2048304 0.2866508 0.8197204
let’s see the odds ratios with their CI’s.
OR <- model4 %>%
coef %>%
exp
CI <- model4 %>%
confint %>%
exp
## Waiting for profiling to be done...
m_OR <- cbind(OR, CI)
m_OR
## OR 2.5 % 97.5 %
## (Intercept) 0.003620088 0.0001161896 0.09987401
## age 1.227316904 1.0125010025 1.49443547
## freetime 1.331958997 1.0464192521 1.70714861
## sexM 2.269865003 1.4192715886 3.66181832
What we can conclude from these coefficients is that:
2X2 cross tabulation can give us an idea of the predictive power of the model.
probs <- predict(model4, type = "response") # Predict probability responses based on the variables
a_predict <- alc %>%
mutate(probability = probs) %>% # add them to our data.frame
mutate(prediction = probability > 0.5) # Compare those probabilities to our significance cutoff to identify the logical value
# Plot the 2X2 cross tabulation of the predictions
table(high_use = a_predict$high_use,
prediction = a_predict$prediction) %>%
prop.table %>%
addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66486486 0.03513514 0.70000000
## TRUE 0.25945946 0.04054054 0.30000000
## Sum 0.92432432 0.07567568 1.00000000
Approximately 70% of the values were predicted correctly.
Let’s compare the performance with some simple guessing model.
model5 <- glm(high_use ~ failures,
data = alc,
family = "binomial")
probs <- predict(model5, type = "response") # Predict probability responses based on the variables
a_predict <- alc %>%
mutate(probability = probs) %>% # add them to our data.frame
mutate(prediction = probability > 0.5) # Compare those probabilities to our significance cutoff to identify the logical value
# Plot the 2X2 cross tabulation of the predictions
table(high_use = a_predict$high_use,
prediction = a_predict$prediction) %>%
prop.table %>%
addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67567568 0.02432432 0.70000000
## TRUE 0.26756757 0.03243243 0.30000000
## Sum 0.94324324 0.05675676 1.00000000
In new model too, approximately 70% of the values were predicted correctly.
Cross-validation is a method of testing a predictive model on unseen data. In cross-validation, the value of a penalty (loss) function (mean prediction error) is computed on data not used for finding the model. Low value = good.
Cross-validation gives a good estimate of the actual predictive power of the model. It can also be used to compare different models or classification methods.
# the logistic regression model m and dataset alc (with predictions) are available
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] NaN
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model4, K = 10)
This week I have covered clustering and classification and data wrangling for next week exercise.
date()
## [1] "Sat Nov 27 18:33:06 2021"
# access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
This dataset contains information of 506 houses in Boston suburbs. This dataset has 14 columns. The data attributes include features such as per capita crime rate by town, proportion of residential land zoned for lots over 25,000 sq.ft., social and Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)) and so on. You may see the all the features explained in brief.
crim :per capita crime rate by town. zn : proportion of residential land zoned for lots over 25,000 sq.ft. indus : proportion of non-retail business acres per town. chas :Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox : nitrogen oxides concentration (parts per 10 million). rm :average number of rooms per dwelling. age :proportion of owner-occupied units built prior to 1940. dis :weighted mean of distances to five Boston employment centres. rad :index of accessibility to radial highways. tax :full-value property-tax rate per $10,000. ptratio : pupil-teacher ratio by town. black :1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. lstat : lower status of the population (percent). medv : median value of owner-occupied homes in $1000s.
# MASS, corrplot, tidyr and Boston dataset are available
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# plot matrix of the variables
pairs(Boston)
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
library(dplyr)
# create a more advanced plot matrix with ggpairs() : draws all possible scatter plots from the columns of a data frame, resulting in a scatter plot matrix.
p <- ggpairs(Boston, mapping = aes( alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)),upper = list(continuous = wrap("cor", size = 2)))
# draw the plot
p
We can Immediately notice that,
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
As we can observe above from the summary of the variables, we can notice that the distribution of values of each column variables have been scaled such that mean of the column data is zero and standard deviation is 1. Max and min values corresponding to a particular column too have adjusted accordingly.
We can create a categorical variable from a continuous one. There are many ways to to do that. Let’s choose the variable crim (per capita crime rate by town) to be our factor variable. We want to cut the variable by quantiles to get the high, low and middle rates of crime into their own categories.
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Divide and conquer: train and test sets When we want to use a statistical method to predict something, it is important to have data to test how well the predictions fit. Splitting the original data to test and train sets allows us to check how well our model works.
The training of the model is done with the train set and prediction on new data is done with the test set. This way you have true classes / labels for the test data, and you can calculate how well the model performed in prediction.
Time to split our data!
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Linear Discriminant analysis is a classification (and dimension reduction) method. It finds the (linear) combination of the variables that separate the target variable classes. The target can be binary or multiclass variable. Linear discriminant analysis is closely related to many other methods, such as principal component analysis.
We are going to fit a linear discriminant analysis using the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2425743 0.2549505 0.2475248 0.2549505
##
## Group means:
## zn indus chas nox rm age
## low 0.9498002 -0.8818109 -0.07145661 -0.8659304 0.37621485 -0.8984451
## med_low -0.1182054 -0.2734839 -0.04298342 -0.5733106 -0.17029730 -0.3710215
## med_high -0.3826200 0.3035152 0.23949396 0.4694169 0.08224339 0.4364332
## high -0.4872402 1.0170891 -0.08120770 1.0256387 -0.40231472 0.8020070
## dis rad tax ptratio black lstat
## low 0.8539696 -0.6642849 -0.7228009 -0.41591419 0.38432040 -0.757562153
## med_low 0.4029215 -0.5514748 -0.5035004 -0.06646038 0.31325200 -0.098249669
## med_high -0.4281234 -0.4030438 -0.2577898 -0.35360409 0.08591139 0.008827365
## high -0.8498399 1.6384176 1.5142626 0.78111358 -0.82099337 0.940158948
## medv
## low 0.46685612
## med_low -0.03523592
## med_high 0.15887541
## high -0.70724990
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08819713 0.68906937 -1.04260245
## indus 0.06935182 -0.24795283 0.30320060
## chas -0.10254663 -0.02407161 0.06542190
## nox 0.38130085 -0.77277415 -1.13004006
## rm -0.08139321 -0.10347967 -0.10912952
## age 0.18905676 -0.38136785 -0.07757152
## dis -0.11014700 -0.32479182 0.47724071
## rad 3.16578129 1.08657142 -0.03249184
## tax 0.03620004 -0.24855227 0.35464095
## ptratio 0.09802962 0.02019188 -0.20553209
## black -0.12778508 0.01535955 0.07587615
## lstat 0.21921711 -0.16160542 0.50951386
## medv 0.22224528 -0.47389063 -0.10734409
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9441 0.0413 0.0146
LDA can be visualized with a biplot.
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
It can be observed that rad has the most influence over the separation of the model. Moreover, it implies that rad variable is more likely to influence the most on the crim variable.
Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 19 7 3 0
## med_low 7 11 5 0
## med_high 0 17 8 1
## high 0 0 0 24
We can observe that diagonals of the cross tabulated table (14,18,11,22) represent correctly classified crime categories where as off-diagonals represent the mis-classifications. Correct classifications will be interpreting low as low, med low as med low, med high as med high and high as high. All others are mis-classifications. Correct classifications = 14+18+11+22 = 65 Total mis-classifications = 5+16+1+2+3+10= 37
Reload the Boston dataset and standardize the dataset. Calculate the distances between the observations.
# load MASS and Boston
library(MASS)
data('Boston')
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# euclidean distance matrix
dist_eu <- dist(Boston)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.119 85.624 170.539 226.315 371.950 626.047
# manhattan distance matrix
dist_man <- dist(Boston, method = 'manhattan')
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.016 149.145 279.505 342.899 509.707 1198.265
K-means is maybe the most used and known clustering method. It is an unsupervised method, that assigns observations to groups or clusters based on similarity of the objects. In the previous exercise we got a hang of distances. The kmeans() function counts the distance matrix automatically, but it is good to know the basics. Let’s cluster a bit!
# k-means clustering
km <-kmeans(Boston, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
K-means needs the number of clusters as an argument. There are many ways to look at the optimal number of clusters and a good way might depend on the data you have.
One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. When you plot the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically.
K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function set.seed() can be used to deal with that.
Steps are as follows.
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
The optimal number of clusters is when the value of total WCSS changes radically. In this case, we observe that radical change from 1 to 2. Hence, two clusters would seem optimal.
Next, Run kmeans() again with two clusters and visualize the results.
# k-means clustering
km <-kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters?
data("Boston")
boston_scaled <- Boston %>%
scale %>%
as.data.frame
set.seed(123)
# k-means clustering
km <- kmeans(boston_scaled, centers = 3)
# clusters
boston_b <- boston_scaled %>%
mutate(crim = as.factor(km$cluster))
# number of rows in the Boston dataset
n <- nrow(boston_b)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_b
# create test set
test <- boston_b
correct_classes <- test$crim
test <- dplyr::select(test, -crim)
# perform LDA using the clusters as target classes
lda.fit <- lda(crim ~ ., data = train)
# target classes as numeric
classes <- as.numeric(train$crim)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
It can be observed that rad has the most influence for the separation of the model.
Moreover, it implies that rad variable is influencing the most into cluster 1.
Next, age variable is the second most influential variable for the separation of the model.
age variable is highly likely to separate cluster 2 and 3 with respect to LD 2 axis.
(more chapters to be added similarly as we proceed with the course!)